Systems and methods for simultaneous translation with integrated anticipation and controllable latency (STACL)

ABSTRACT

Presented herein are embodiments of a prefix-to-prefix framework for simultaneous translation that implicitly learns to anticipates in a single translation. Within these frameworks are effective “wait-k” policy model embodiments that may be trained to generate a target sentence concurrently with a source sentence but lag behind by a predefined number of words. Embodiments of the prefix-to-prefix framework achieve low latency and better quality when compared to full-sentence translation in four directions: Chinese↔English and German↔English. Also presented herein is a novel latency metric that addresses deficiencies of previous latency metrics.

CROSS REFERENCE TO RELATED PATENT APPLICATIONS

The present application claims priority benefit, under 35 U.S.C. §119(e), to commonly-assigned U.S. Patent Application No. 62/738,367,filed on Sep. 28, 2018, entitled “Predictive Simultaneous Translationwith Arbitrary Latency Constraints,” listing Mingbo Ma, Liang Huang, HaoXiong, Chuanqiang Zhang, Zhongjun He, Kaibo Liu, Hairong Liu, Xing Li,and Haifeng Wang as inventor, which application is herein incorporatedby reference as to its entire content. Each reference mentioned in thispatent document is incorporated by reference herein in its entirety.

BACKGROUND

The present disclosure relates generally to systems and methods forinterpretation automation. More particularly, the present disclosurerelates to systems and methods for simultaneous translation withintegrated anticipation and controllable latency.

Simultaneous translation aims to automate simultaneous interpretation,which translates concurrently with the source-language speech, with adelay of only a few seconds. This additive latency is much moredesirable than the multiplicative 2× slowdown in consecutiveinterpretation. With this appealing property, simultaneousinterpretation has been widely used in many scenarios includingmultilateral organizations (United Nations/European Union), andinternational summits (Asia-Pacific Economic Cooperation/G20). However,due to the concurrent comprehension and production in two languages, itis extremely challenging and exhaustive for humans: the number ofqualified simultaneous interpreters worldwide is very limited, and eachcan perform only for about 15-30 minutes in one turn as error rates growexponentially just after minutes of interpreting. Moreover, limitedmemory forces human interpreters to routinely omit source content.Therefore, there is a critical need for simultaneous machine translationtechniques that reduce the burden on human interpreters and makesimultaneous translation more accessible and affordable.

Unfortunately, simultaneous translation is also notoriously difficultfor machines, due in large part to the diverging word order between thesource and target languages. For example, to simultaneously translate aSubject-Verb-Object (SOV) language, such as Japanese or German, to aSubject-Object-Verb (SVO) language, such as English or Chinese(technically, German is SOV+V2 in main clauses, and SOV in embeddedclauses; Mandarin is a mix of (SVO+SOV): one has to know, i.e., waitfor, the source language verb. As a result, existing so-called“real-time” translation systems resort to conventional full-sentencetranslation, causing an undesirable latency of at least one sentence.

Noticing the importance of verbs in SOV-to-SVO translation, someapproaches attempt to reduce latency by explicitly predicting thesentence-final German or English verbs, which is limited to thisparticular case, or unseen syntactic constituents, which requiresincremental parsing on the source sentence. Some have proposed totranslate on an optimized sentence segment level to improve translationaccuracy.

Others use a two-stage model whose base model is a full-sentence model.On top of the full-sentence model, the two-stage model uses a READ/WRITE(R/W) model to decide, at every step, whether to wait for another sourceword (READ) or to emit a target word using the pre-trained base model(WRITE). This R/W model is trained by reinforcement learning to prefer(rather than enforce) a specific latency, without updating the basemodel. However, such approaches all have two major limitations incommon: (a) they cannot achieve any predetermined latency, such as,e.g., a 3-word delay; (b) their base translation model is still trainedon full sentences; and (c) their systems are overcomplicated, involvingmany components (such as pre-trained model, prediction, andreinforcement learning) and are difficult to train.

Therefore, it would be desirable to have simultaneous translation thatintegrates anticipation and translation.

BRIEF DESCRIPTION OF THE DRAWINGS

References will be made to embodiments of the invention, examples ofwhich may be illustrated in the accompanying figures. These figures areintended to be illustrative, not limiting. Although the invention isgenerally described in the context of these embodiments, it should beunderstood that it is not intended to limit the scope of the inventionto these particular embodiments. Items in the figures may be not toscale.

FIG. 1 illustrates a wait-k model according to various embodiments ofthe present disclosure.

FIG. 2 is a different illustration of the wait-k model shown in FIG. 1.

FIG. 3 is a comparison between a common sequence-to-sequence(seq-to-seq) framework and a prefix-to-prefix framework, according toembodiments of the present disclosure.

FIG. 4 illustrates tail beam search according to various embodiments ofthe present disclosure.

FIG. 5 is a flowchart of an illustrative process for using a neuralnetwork that has been trained in a prefix-to-prefix manner forlow-latency real-time translation, according to various embodiments ofthe present disclosure.

FIG. 6 is a flowchart of an illustrative process for using a neuralnetwork that has been trained on a full sentence manner for low-latencyreal-time translation in a prefix-to-prefix manner, according to variousembodiments of the present disclosure.

FIG. 7A illustrates how a wait-2 policy renders a user increasingly outof sync with a speaker.

FIG. 7B illustrates how a wait-2 policy with catchup, according tovarious embodiments of the present disclosure, shrinks the tail andstays closer to the ideal diagonal thus reducing effective latency.

FIG. 8 is a flowchart of an illustrative process for preventing atranslation delay from increasing over time, according to variousembodiments of the present disclosure.

FIG. 9A and FIG. 9B illustrate an Average Lagging latency metricaccording to various embodiments of the present disclosure.

FIG. 10 is a flowchart of an illustrative process for measuring how mucha user is out of synch with a speaker, according to various embodimentsof the present disclosure.

FIG. 11A and FIG. 11B illustrate Bilingual Evaluation Understudy (BLEU)score and AP comparisons with the model by Gu et al. (2017) (Jiatao Gu,Graham Neubig, Kyunghyun Cho, and Victor 0. K. Li. 2017. “Learning totranslate in real-time with neural machine translation,” In Proceedingsof the 15th Conference of the European Chapter of the Association forComputational Linguistics, EACL 2017, Valencia, Spain, Apr. 3-7, 2017,Volume 1: Long Papers, pages 1053-1062.aclanthology.info/papers/E17-1099/e17-1099) for different wait-k modelson German-to-English (FIG. 11A) and English-to-German (FIG. 11B)translation.

FIG. 12A and FIG. 12B illustrate BLEU scores for wait-k models onGerman-to-English (FIG. 12A) and English-to-German (FIG. 12B) withlatency measured by Averaged Lagging (AL), according to variousembodiments of the present disclosure.

FIG. 13A and FIG. 13B illustrate BLEU scores and AL comparisons withdifferent wait-k models on Chinese-to-English (FIG. 13A) andEnglish-to-Chinese (FIG. 13B) translations on a development (dev) set,according to various embodiments of the present disclosure.

FIG. 14A and FIG. 14B illustrate translation quality against latencymetrics on German-to-English simultaneous translation, showing wait-kmodels, test-time wait-k results, full-sentence baselines, and areimplementation of Gu et al. (2017), all based on the same Transformer,according to various embodiments of the present disclosure.

FIG. 15A and FIG. 15B illustrate translation quality against latencymetrics on English-to-German simultaneous translation, according tovarious embodiments of the present disclosure.

FIG. 16A and FIG. 16B illustrate translation quality against latency onzh→en.

FIG. 17A and FIG. 17B illustrate translation quality against latency onen→zh.

FIG. 18-FIG. 23 illustrate real running examples that have beengenerated from the introduced model(s) and baseline framework todemonstrate the effectiveness of the disclosed systems, according tovarious embodiments of the present disclosure.

FIG. 24 depicts a simplified block diagram of a computingdevice/information handling system, according to various embodiments ofthe present disclosure.

FIG. 25A-FIG. 25B depict respective tables 2A and 2B showing performancedata of wait-k with Transformer, its catchup version, and wait-k withRNN models with various k on the de↔en and zh↔en test sets, according tovarious embodiments of the present disclosure.

DETAILED DESCRIPTION OF EMBODIMENTS

In the following description, for purposes of explanation, specificdetails are set forth in order to provide an understanding of theinvention. It will be apparent, however, to one skilled in the art thatthe invention can be practiced without these details. Furthermore, oneskilled in the art will recognize that embodiments of the presentinvention, described below, may be implemented in a variety of ways,such as a process, an apparatus, a system, a device, or a method on atangible computer-readable medium.

Components, or modules, shown in diagrams are illustrative of exemplaryembodiments of the invention and are meant to avoid obscuring theinvention. It shall also be understood that throughout this discussionthat components may be described as separate functional units, which maycomprise sub-units, but those skilled in the art will recognize thatvarious components, or portions thereof, may be divided into separatecomponents or may be integrated together, including integrated within asingle system or component. It should be noted that functions oroperations discussed herein may be implemented as components. Componentsmay be implemented in software, hardware, or a combination thereof.

Furthermore, connections between components or systems within thefigures are not intended to be limited to direct connections. Rather,data between these components may be modified, re-formatted, orotherwise changed by intermediary components. Also, additional or fewerconnections may be used. It shall also be noted that the terms“coupled,” “connected,” or “communicatively coupled” shall be understoodto include direct connections, indirect connections through one or moreintermediary devices, and wireless connections.

Reference in the specification to “one embodiment,” “preferredembodiment,” “an embodiment,” or “embodiments” means that a particularfeature, structure, characteristic, or function described in connectionwith the embodiment is included in at least one embodiment of theinvention and may be in more than one embodiment. Also, the appearancesof the above-noted phrases in various places in the specification arenot necessarily all referring to the same embodiment or embodiments.

The use of certain terms in various places in the specification is forillustration and should not be construed as limiting. A service,function, or resource is not limited to a single service, function, orresource; usage of these terms may refer to a grouping of relatedservices, functions, or resources, which may be distributed oraggregated. Furthermore, the use of memory, database, information base,data store, tables, hardware, and the like may be used herein to referto system component or components into which information may be enteredor otherwise recorded.

Further, it shall be noted that: (1) certain steps may optionally beperformed; (2) steps may not be limited to the specific order set forthherein; (3) certain steps may be performed in different orders; and (4)certain steps may be done concurrently.

Any headings used herein are for organizational purposes only and shallnot be used to limit the scope of the description or the claims. Alldocuments cited herein are incorporated by reference herein in theirentirety.

Furthermore, it shall be noted that many embodiments described hereinare given in the context of audio recordings, but one skilled in the artshall recognize that the teachings of the present disclosure are notlimited to audio applications and may equally be used to create andconsolidate video content and may also be extended to includeclassification of objects or people in the video, motion, location,time, and other parameters.

A. Overview

In this document, “word” refers to any part of a word or a languagetoken from which a meaning may be derived. The term “simultaneous” means“in real-time” as understood by a person of skill in the relevant art,e.g., a simultaneous interpreter, i.e., simultaneous is not limited tothe common meaning of “exactly at the same time.”

Presented herein are very simple yet effective solutions that takeadvantage of a novel prefix-to-prefix framework that predicts targetwords using, e.g., only prefixes of the source sentence. Within thisframework is presented a simple wait-k policy whose translation is,e.g., always k words behind the input. Considering theChinese-to-English example in FIG. 1 and FIG. 2, where thesentence-final Chinese verb huìwì (“meet”) needs to be translatedearlier to avoid a long delay. A wait-2 model correctly anticipates theEnglish verb given only the first 4 Chinese words (which provide enoughclues for this prediction given many similar prefixes in the trainingdata). The presented embodiments make the following contributions:

(1) a prefix-to-prefix framework tailored to simultaneous translationand trained from scratch without requiring full-sentence models. Theframework seamlessly integrates implicit anticipation and translation ina single model by directly predicting target words without firstpredicting source words and then translating the source words intotarget words.

(2) a special case “wait-k” policy that can satisfy any latencyrequirements;

(3) the presented strategies may be applied to most sequence-to-sequencemodel, e.g., with relatively minor changes; an application isdemonstrated on Recurrent Neural Network (RNN) and Transformer;

(4) a new latency metric, called “Averaged Lagging,” addressesdeficiencies of previous metrics; and

(5) experiments demonstrate that the strategy achieves low latency andreasonable BLEU scores (compared to full-sentence translation baselines)in four directions: Chinese↔English and German↔English.

B. Preliminaries: Full-Sentence Neural Machine Translation (NMT)

The following brief review of standard (full-sentence) neuraltranslation sets up some of the notations.

Regardless of the particular design of different seq-to-seq models, theencoder always takes the input sequence x=(x₁, . . . , x_(n)) where eachx_(i)∈

^(d) ^(x) is a word embedding of dimensions, and produces a new sequenceof hidden states h=ƒ(x)=(h₁, . . . , h_(n)). The encoding function ƒ canbe implemented by RNN or Transformer.

On the other hand, a (greedy) decoder predicts the next output wordy_(t) given the source sequence (actually its representation h) andpreviously generated words, denoted y_(<t)=(y₁, . . . , y_(t−1)) Thedecoder stops when it emits an end-of-sentence signal (e.g., <eos>), andthe final hypothesis y=(y₁, . . . , <eos>) has a probabilityp(y|x)=Π_(t=1) ^(|y|) p(y _(t) |x,y _(<t))  (Eq. 1)

At training time, the conditional probability of each ground-truthtarget sentence y* may be maximized given input x over the wholetraining data D, or equivalently minimizing the following loss:

(D)=Σ_((x,y*)∈D) log p(y*|x)  (Eq. 2)

C. Prefix-to-Prefix and Wait-k Policy

In full-sentence translation, discussed above, each y_(i) is predictedusing the entire source sentence x. But in simultaneous translation, onetranslates concurrently with the (growing) source sentence. Therefore,certain embodiments enable the design of a new prefix-to-prefixarchitecture to (be trained to) predict words in a target sentence byusing a source prefix.

1. Prefix-to-Prefix Architecture

Let g(t) be a monotonic non-decreasing function of t that denotes thenumber of source words processed by an encoder when deciding the targetword y_(t). For example, in FIG. 1 and FIG. 2, g(3)=4, i.e., a 4-wordChinese prefix is used to predict the target word y₃=“met.” Inembodiments, the source prefix (x₁, . . . , x_(g)(t)) rather than thewhole input x may be used to predict the target word y_(t):p(y _(t) |x _(≤g(t)) ,y _(<t))

Therefore, the decoding probability may be expressed as:p _(g)(y|x)=Π_(t=1) ^(|y|) p(y _(t) |x _(≤g(t)) ,y _(<t))  (Eq. 3)

and given training D, the training objective may be expressed as:

_(g)(D)=−Σ_((x,y*)∈D) log p _(g)(y*|x)  (Eq. 4)

Generally speaking, g(t) may be used to represent any arbitrary policy.In two special cases, g(t) may be constant: (a) g(t)=|x|: baselinefull-sentence translation; (b) g(t)=0: an “oracle” that does not rely onany source information. It is noted that in any case, 0≤g(t)≤|x| for allt.

In embodiments, the “cut-off” step, τ_(g)(|x|), may be defined as thedecoding step when the source sentence finishes, e.g., as:τ_(g)(|x|)=min{t|g(t)=|x|}  (Eq. 5)

For example, in FIG. 1 and FIG. 2 that illustrate a wait-k modelaccording to various embodiments of the present disclosure, the cut-offstep is 6, i.e., the Chinese sentence finishes right before y₆=“in.” InExample, in FIG. 1, the wait-k model outputs each target word y_(t)given source-side prefix x₁ . . . x_(t+k-1), often before seeing thecorresponding source word (here k=2, outputting y₃=“met” before x₇=“

”). Without anticipation, a 5-word wait 110 is needed.

FIG. 2 is a different illustration of the wait-k model shown in FIG. 1.FIG. 2 highlights the step of outputting the English verb “met,” whichcorresponds to the sentence-final

Chinese verb

. Unlike a simultaneous translator without anticipation, which wouldhave to wait 5 words, the wait-k policy (here k=2) translatesconcurrently with the source sentence, but k words behind. The modelcorrectly predicts the English verb given just the first 4 Chinese words(in bold) that literately translate to “Bush president in Moscow.”

While most existing approaches in simultaneous translation might beviewed as special cases of the presented framework, only their decodersare prefix-to-prefix, and their training still relies on afull-sentence-based approach. In other words, existing approaches use afull-sentence translation model to perform simultaneous decoding, whichis a mismatch between training and testing. In contrast, variousembodiments train a model to predict using source prefixes.

In embodiments, prefix-to-prefix training implicitly learns anticipationand, advantageously, overcomes word-order differences, such as SOV→SVO.Using the example in FIG. 1 and FIG. 2, in embodiments, anticipation ofthe English verb is enabled due to the training data comprising numerousprefix-pairs in the form of (X zài Y . . . , X met . . . ). Therefore,although the prefix x≤4 “

” (literally meaning “Bush president in Moscow”) does not contain averb, the prefix nevertheless provides sufficient clues to predict theverb “met.”

2. Wait-k Policy

As an example within the prefix-to-prefix framework a wait-k policy ispresented that, in embodiments, first waits k source words, and thentranslates concurrently with the rest of source sentence, i.e., theoutput is k words behind the input, i.e., similar to human simultaneousinterpretation that generally starts a few seconds into the speakers'speech and ends a few seconds after the speaker finishes.

FIG. 3 is a comparison between a common seq-to-seq framework and aprefix-to-prefix framework according to embodiments of the presentdisclosure. The prefix-to-prefix framework example shows a wait-2 policyas an example. As demonstrated in example in FIG. 3, assuming k=2, thefirst target word may be predicted using the first 2 source words, andthe second target word may be predicted using the first 3 source words,etc. More formally, its g(t) may be defined as:g _(wait-k)(t)=min{k+t−1,|x|}  (Eq. 6)

In embodiments, for this policy, the cut-off point τ_(g) _(wait-k) (|x|)is |x|−k. From this step on, g_(wait-k)(t) may be fixed to |x|, i.e.,the remaining target words (including this step) may be generated usingthe full source sentence. This part of the output, y_(≥|x|-k), may bereferred to as the “tail,” discussed in greater detail with reference toFIG. 8.

In embodiments, beam search may be performed on the tail (referredherein as “tail beam search”), but all earlier words may be generatedgreedily one by one. FIG. 4 illustrates tail beam search according tovarious embodiments of the present disclosure. As shown in FIG. 4, tailbeam search may occur after the entire source sentence is finished. Ageneral prefix-to-prefix policy, however, may use beam search wheneverg(t)=g(t−1), i.e., predicting more than one word using the same inputprefix (e.g., the tail in wait-k).

Implementation details further below describe two exemplaryimplementations of the general prefix-to-prefix policy using RNN andTransformer as the underlying models.

FIG. 5 is a flowchart of an illustrative process for using a neuralnetwork that has been trained in a prefix-to-prefix manner forlow-latency real-time translation, according to various embodiments ofthe present disclosure. In embodiments, process 500 starts by using aneural network that has been trained in a prefix-to-prefix manner toreceive a source language token (505). The neural network may be trainedby using a sequence of source language words that is shorter than asentence and one or more previously generated target language words topredict some or all target language words corresponding to the sentence.

In embodiments, the source language tokens may be used as a prefix (510)that is shorter than a complete sentence to predict a target languagetoken.

In response to receiving a next source language token (515), the prefixmay be updated, and the updated prefix and one or more previouslypredicted target language tokens may be used to predict (520) a nexttarget language token that is then output (525).

Finally, in response to receiving an end-of-sentence signal,substantially all source language tokens in a sentence may be used togenerate (530) any remaining target language tokens, e.g., at once.

FIG. 6 is a flowchart of an illustrative process for using a neuralnetwork that has been trained in a full sentence manner for low-latencyreal-time translation in a prefix-to-prefix manner, according to variousembodiments of the present disclosure. In embodiments, process 500starts by training (602) a neural network to generate a set ofcorresponding target language tokens based on a set of source languagetokens that represent a complete sentence.

The neural network is used to receive a first set (605) of sourcelanguage tokens associated with a sentence.

One or more of the first set of source language tokens are used as aprefix (610) to predict a first set of target language tokens, whereinthe prefix is shorter than the sentence.

In response to receiving a second set of source language tokens, theprefix is updated (615) and used together with one or more previouslypredicted target language tokens to predict a second set (620) of targetlanguage tokens and output (625) one or more target language tokens.Finally, responsive to receiving an end-of-sentence signal, usingsubstantially all source language tokens in the sentence to generate(625) any remaining target language tokens at once.

Test-Time Wait-k. As an example of a test-time prefix-to-prefiximplementation discussed in the above subsection, various embodimentsimplement a “test-time wait-k” method, i.e., using a full-sentence modelbut decoding it with a wait-k policy. Experiments demonstrate that anembodiment of this method, without the anticipation capability, performsworse than implementations that utilize a genuine wait-k policy when kis small, but gradually improves in accuracy, and eventually bothmethods approach the full-sentence baseline (k=∞).

D. Refinement: Wait-k with Catchup

As previously mentioned, the wait-k decoding lags k words behind theincoming source stream. In the ideal case where the input and outputsentences have equal length, the translation finishes k steps after thesource sentence finishes, i.e., the tail length is also k. This isconsistent with human interpretation, which starts and stops a fewseconds after the speaker starts and stops.

However, input and output sentences generally have different lengths. Insome directions, such as from Chinese to English, the target side isoftentimes significantly longer than the source side, with an averageground truth tgt/src ratio, r=|y*|/|x|, of about 1.25. In this case, ifthe vanilla wait-k policy is followed, the tail length will be0.25|x|+k, which increases with input length. For example, given a20-word Chinese input sentence, the tail of wait-3 policy will be 8words long, i.e., almost half of the source length. This has two mainnegative effects:

(a) as decoding progresses, the user will effectively lag further andfurther behind (with each Chinese word practically translating to 1.25English words), thus, rendering the user more and more out of sync withthe speaker, as illustrated by FIG. 7A for a wait-2 policy (the diagonalline denotes an ideal, i.e., perfect synchronization); and (b) once asource sentence finishes, the rather long tail is displayed all at once,thus, causing a cognitive burden on the user. In one or moreembodiments, the tail may in principle be displayed concurrently withthe first k words of the next sentence, but the tail is now much longerthan k. Such negative effects worsen for longer input sentences. Toaddress this problem, certain embodiments utilize a “wait-k+catchup”policy, such that the user is still k words behind the input in terms ofreal information content, i.e., k source words behind the ideal perfectsynchronization policy denoted by the diagonal line in FIG. 7B.

FIG. 7B illustrates how a wait-2 policy with catchup according tovarious embodiments of the present disclosure shrinks the tail and stayscloser to the ideal diagonal, thereby, reducing the effective latency.Arrows 502 and 504 illustrate respective 2 and 4 word-lags behind thediagonal line. For example, assuming that the tgt/src ratio is r=1.25,then 5 target words may be output for every 4 source words; i.e., thecatchup frequency, denoted as c=r−1, is 0.25.

More formally, using catchup frequency c, the new policy may beexpressed as:g _(wait-k,c)(t)=min{k+t−1−[ct],|x|}  (Eq. 7)

and decoding and training objectives may change accordingly. It is notedthat, in embodiments, the model may be trained to catchup using this newpolicy.

On the other hand, when translating from longer source sentences toshorter targets, e.g., from English to Chinese, it is possible that thedecoder finishes generation before the encoder sees the entire sourcesentence, thus, ignoring the “tail” on the source side. Therefore, inembodiments, “reverse” catchup is employed, i.e., catching up on theencoder instead of the decoder. For example, in English-to-Chinesetranslation, one extra word may be encoded every 4 steps, i.e., encoding5 English words per 4 Chinese words. In this case, the “decoding”catchup frequency c=r−1=−0.2 is negative but Eq. 7 still holds. It isnoted that any arbitrary c, e.g., c=0.341, may be used where the catchuppattern is not as easy as “1 in every 4 steps,” but still maintainsroughly a frequency of c catchups per source word.

E. New Latency Metric: Average Lagging

Besides translation quality, latency is another crucial aspect forevaluating simultaneous translation. Existing latency metrics arereviewed, next, and their limitations are highlighted. Then a newlatency metric that address these limitations is introduced.

1. Existing Metrics: CW and AP

Consecutive Wait (CW) commonly denotes the number of source words waitedbetween two target words. Based on the notation herein, for a policyg(⋅), the per-step CW at step t isCW _(g)(t)=g(t)−g(t−1)

The CW of a sentence-pair (x, y) is the average CW over all consecutivewait segments:

${{CW}_{g}\left( {x,y} \right)} = {\frac{\sum\limits_{t = 1}^{y}{{CW}_{g}(t)}}{{\sum\limits_{t = 1}^{y}{1{{CW}_{g}(t)}}} > 0} = \frac{x}{\sum\limits_{t = 1}^{y}1_{{{CW}_{g}{(t)}} > 0}}}$

In other words, CW measures the average lengths of consecutive waitsegments (the best case is 1 for word-by-word translation, or wait-1,and the worst case is |x| for full-sentence MT). The drawback of CW isits insensitivity to the actual lagging behind, as discussed in theprevious section; for example, catchup has no effect on CW.

Another existing latency measurement, Average Proportion (AP), measuresthe proportion of the shaded area for a policy in FIG. 7:

$\begin{matrix}{{{AP}_{g}\left( {x,y} \right)} = {\frac{1}{{x}{y}}{\sum\limits_{t = 1}^{y}{g(t)}}}} & \left( {{Eq}.\mspace{11mu} 8} \right)\end{matrix}$

AP has two major flaws: First, it is sensitive to input length. Forexample, considering the wait-1 policy. When |x|=|y|=1, AP is 1, andwhen |x|=|y|=2, AP is 0.75, and eventually AP approaches 0.5 when|x|=|y|→∞. However, in all these cases, there is a one-word delay, so APis not fair between long and short sentences. Second, being expressed asa percentage, the actual delay in number of words is not obvious to theuser.

FIG. 8 is a flowchart of an illustrative process for preventing atranslation delay from increasing over time, according to variousembodiments of the present disclosure. Process 800 comprises training aprefix-to-prefix neural network to adjust (805) the difference between anumber of target and source language tokens to keep their ratio aboutthe same. In embodiments, this may be accomplished by adding orsubtracting, on average, a constant number of source language tokens toprevent a translation delay from increasing over time.

In embodiments, the ratio may be inverted, e.g., when interpreting in areverse direction (810).

2. New Metric: Average Lagging

Based on the concept of “lagging behind the ideal policy” discussed withrespect to FIGS. 7A and 7B, a new metric, called “average lagging” (AL),is introduced and illustrated in FIG. 9A for the simple case when|x|=|y| and FIG. 9B for the more general case when |x|≠|y|.

In embodiments, AL may be used to quantify the degree a user is out ofsync with a speaker, in terms of the number of source words. Forsimplicity, FIG. 9A shows a special case when |x|=|y|. The thick lineindicates a “wait-0” policy where the decoder is one word ahead of theencoder. This policy may be defined as having an AL of 0. Policy 602,604, 902, 904 indicates a “wait-1” policy where the decoder lags oneword behind the wait-0 policy. In this case, the policy's AL may bedefined as 1. The policy 212, 614, 912, 914 indicates a “wait-4” policywhere the decoder lags 4 words behind the wait-0 policy, so its AL is 4.It is noted that in both cases, we only count up to (and including) thecut-off point (indicated by horizontal arrows 630, 631 and 640, 641,respectively, i.e., 10 and 7, respectively) because the tail may begenerated instantly without further delay. More formally, for the idealcase where |x|=|y|, one may define

$\begin{matrix}{{{AL}_{g}\left( {x,y} \right)} = {{\frac{1}{\tau_{g}\left( {x} \right)}{\sum\limits_{t = 1}^{\tau_{g}{({x})}}{g(t)}}} - \left( {t - 1} \right)}} & \left( {{Eq}.\mspace{11mu} 9} \right)\end{matrix}$

and infer that the AL for wait-k is exactly k.

In more realistic cases, such as the case represented by FIG. 9B when|x|<|y|, as explained with respect to FIG. 7, more and more delays mayaccumulate when the target sentence grows. For example, wait-1 policy904 in FIG. 9B has a delay of more than 3 words at decoding its cut-offstep 10, and wait-4 policy 914 has a delay of almost 6 words at itscut-off step 7. This difference is mainly caused by the tgt/src ratio.In FIG. 9B, there are 1.3 target words per source word. More generally,the “wait-0” policy may be offset and one may redefine:

$\begin{matrix}{{{AL}_{g}\left( {x,y} \right)} = {{\frac{1}{\tau_{g}\left( {x} \right)}{\sum\limits_{t = 1}^{\tau_{g}{({x})}}{g(t)}}} - \frac{t - 1}{r}}} & \left( {{Eq}.\mspace{11mu} 10} \right)\end{matrix}$

where τ_(g)(|x|) denotes the cut-off step, and r=|y|/|x| is thetarget-to-source length ratio. One can observe that wait-k with catchuphas an AL k.

F. Implementation Details

Exemplary implementation details for training prefix-to-prefix with RNNand Transformer are described next.

1. Background: Full-Sentence RNN

The (unidirectional) RNN encoder maps a sequence x into a sequence ofhidden states:{right arrow over (h)} _(i)=RNN(x _(i) ,{right arrow over (h)}_((i-1)):θ_(ε)),

then list of hidden states h represent the source side. The decoder maytake another RNN to generate the target side hidden representations atdecoding step t:{right arrow over (s)} _(t)=RNN({right arrow over (s)} _((t-1)) ,h;∝_(d))  (Eq. 11)

2. Training Simultaneous RNN

Unlike full-sentence translation, in simultaneous translationembodiments, the source words may be fed into the encoder one by one.For decoding, Eq. 11 may be modified to predict using source prefix:{right arrow over (s)} _(t)=RNN({right arrow over (s)} _((t-1)) ,h_(≤g(t));θ_(d))

3. Background: Full-Sentence Transformer

First, the Transformer architecture is briefly reviewed step by step tohighlight the differences between the conventional Transformer andsimultaneous Transformer embodiments. The encoder of Transformer worksin a self-attention fashion and takes an input sequence x, and producesa new sequence of hidden states z=(z₁, . . . , z_(n)) where z_(i)∈

^(d) ^(z) is as follows:z _(i)=Σ_(j=1) ^(n)α_(ij) P _(W) _(V) (x _(j))  (Eq. 12)

Here P_(W) _(V) (⋅) is a projection function from the input space to thevalue space, and α_(ij) denotes the attention weights:

$\begin{matrix}{{\alpha_{ij} = \frac{\exp\; e_{ij}}{\sum\limits_{l = 1}^{n}{\exp\; e_{il}}}},{e_{ij} = \frac{{P_{W_{Q}}\left( x_{i} \right)}{P_{W_{V}}\left( x_{j} \right)}^{T}}{\sqrt{d_{x}}}}} & \left( {{Eq}.\; 13} \right)\end{matrix}$

where e_(ij) measures similarity between inputs.

Here P_(W) _(Q) (x_(i)) and P_(W) _(K) (x_(j)) project x_(i) and x_(j)to query and key spaces, respectively.

Embodiments herein may use 6 layers of self-attention and use h todenote the top layer out output sequence (i.e., the source context).

On the decoder side, during training time, the ground truth outputsequence y*=(y₁*, . . . , y_(m)*) may go through the same self-attentionto generate hidden self-attended state sequence c=(c₁, . . . , c_(m)).It is noted that because decoding is incremental, e_(ij) may be set 0 ifj>i in Eq. 13 to restrict self-attention to previously generated words.

In embodiments, in each layer, after gathering the hiddenrepresentations for each target word through self-attention,target-to-source attention may be performed:c _(i)′=Σ_(j=1) ^(n)β_(ij) P _(W) _(V′) (h _(j))

similar to self-attention, β_(ij) measures the similarity between h_(j)and c_(i) as in Eq. 13.

4. Training Simultaneous Transformer

In embodiments, simultaneous translation feeds the source wordsincrementally to the encoder, but a naive implementation of suchincremental encoder/decoder may be inefficient. A faster implementationis described below.

For the encoder, during training time, an entire sentence may be fed atonce to the encoder. But unlike the self-attention layer in conventionalTransformer (Eq. 13), in embodiments, each source word may beconstrained to attend only to its predecessors (similar to decoder-sideself-attention), effectively simulating an incremental encoder:

$\alpha_{ij}^{(t)} = \left\{ {{\begin{matrix}\frac{\exp\; e_{ij}^{(t)}}{\sum\limits_{i = 1}^{g{(t)}}{\exp\; e_{il}^{(t)}}} & {{{if}\mspace{14mu} i},{j \leq {g(t)}}} \\0 & {otherwise}\end{matrix}e_{ij}^{(t)}} = \left\{ \begin{matrix}\frac{{P_{W_{Q}}\left( x_{i} \right)}{P_{W_{K}}\left( x_{j} \right)}^{T}}{\sqrt{d_{x}}} & {{{if}\mspace{14mu} i},{j \leq {g(t)}}} \\0 & {otherwise}\end{matrix} \right.} \right.$

Then, in embodiments, a newly defined hidden state sequence z(t) (z₁^((t)), . . . , z_(n) ^((t))) at decoding step t may be expressed as:z _(i) ^((t))=Σ_(j=1) ^(n)α_(ij) ^((t)) P _(W) _(V) (x _(j))  (Eq. 14)

When a new source word is received, all previous source words shouldadjust their representations.

G. Experiments

It shall be noted that these experiments and results are provided by wayof illustration and were performed under specific conditions using aspecific embodiment or embodiments; accordingly, neither theseexperiments nor their results shall be used to limit the scope of thedisclosure of the current patent document.

This section first presents the accuracy and latency of the introducedwait-k model. Then, it is demonstrated that the catchup model reducesthe latency even further with little or no sacrifice of accuracy.Finally, some examples from the dev set and from recent news areanalyzed.

The performance of the various models is demonstrated on foursimultaneous translation directions: Chinese↔English and German↔English.For the training data, the parallel corpora available from WorkshopStatistical Machine Translation (WMT15) is used for German↔Englishtranslation (4.5 million sentence pairs) and the National Institute ofStandards and Technology (NIST) corpus for Chinese↔English translation(2 million sentence pairs). First, byte-pair encoding (BPE) is appliedon all texts in order to reduce vocabulary size for both source andtarget sides. Then the sentences pairs whose length are longer than 50and 256 words for respective English-to-German and Chinese-to-Englishare excluded. For German↔English evaluation, newstest-2013 (dev) is usedas development set and newstest-2015 (test) is used as test set with3,000 and 2,169 sentence pairs, respectively. The implementation isadapted from PyTorch-based Open-Source Neural Machine Translation(OpenNMT). For Chinese↔English evaluation, NIST 2006 and NIST 2008. Theycontain 616 and 691 Chinese sentences, each with four Englishreferences. In the catchup experiments, the decoding catchup frequencyof c=0.25, which is derived from the dev set tgt/src length ratio of1.25, is used. For de↔en translation task, no catch is used since thetgt/src ratio is almost 1.

When translating from Chinese to English, 4-reference BLEU scores arereported and, in the reverse direction, the second among the fourEnglish references is used as the source text, and 1-reference BLEUscores are reported.

The implementation is adapted from PyTorch-based Open-Source NeuralMachine Translation (OpenNMT). The Transformer's parameters are the sameas those of the base model's parameter settings in the original paper(Vaswani et al., 2017 Attention is all you need. In Advances in NeuralInformation Processing Systems 30).

FIG. 10 is a flowchart of an illustrative process for measuring how mucha user is out of synch with a speaker, according to various embodimentsof the present disclosure. Process 1000 begins when a decoding step(1005) is determined at which a source sentence finishes. Then a numberof words in the source sentence at the decoding step is determined(1010). Finally, the number of words in the source sentence at thedecoding step is used as a measure (1015) of how much a decoder is outof synch with an encoder. In embodiments, this measure is representativehow of much a user is out of synch with a speaker.

1. Performance of Wait-k Model

In FIG. 15A and FIG. 15B, the BLEU score and AP are compared with themodel from Gu et al., 2017 on dev set for English-to-German andGerman-to-English tasks. In FIG. 15A and FIG. 15B, 702 and 704 representfull-sentence baselines with RNNs (greedy decoding and beam-search withbeam-size 11, respectively). Line plots 720 and 730 represent the wait-kpolicy's greedy and tail beam search results with RNNs. Point-pairs arethe results from Gu et al. (2017) using greedy decoding and beam-search(beam-size 5) with models trained with various delay targets: 706, 708:full-sentence, 740, 741: CW=8, 750, 751: CW=5, 760, 761: CW=2, 770, 771:AP=0.3, 780, 781: AP=0.5, 790, 791: AP=0.7. It is noted that Gu et al.'smodels trained with AP=0.5 achieve a test-time AP around 0.7 (de→en) and0.66 (en→de).

The results indicate that the RNN-based model according to variousembodiments outperforms the model from Gu et al. (2017) in bothtranslation directions and the simultaneous Transformer according tovarious embodiments achieves much better performance.

FIG. 16A and FIG. 16B illustrate BLEU scores for wait-k models onGerman-to-English (FIG. 16A) and English-to-German (FIG. 16B) withlatency measured by AL. The BLEU score is compared together with ALbetween RNN and Transformer-based models. Also included are AL values ofone model of Gu et al. (2017) in each direction based on the decodedaction sequences provided by the authors in Gu et al. (2017).

802, 804 and 806, 808 are greedy decoding and beam-search baselines forTransformer and RNN models, respectively. Similarly, 830 and 832 aredecoded using greedy strategy, while 820 and 822 are decoded with tailbeam search. 810: AP=0.5 and 850: AP=0.7 are the same points as in FIG.11A and FIG. 11B.

The performance between Chinese and English is shown in FIG. 17A andFIG. 16B and FIG. 17B which illustrate BLEU scores and AL comparisonswith different wait-k models on Chinese-to-English (FIG. 17A) andEnglish-to-Chinese (FIG. 17B) translations on dev set. Note that 4-refBLEU is used for Chinese-to-English but 1-ref BLEU is used forEnglish-to-Chinese since the multiple references are only available onthe English side. 902, 904 and 906, 908 are greedy decoding andbeam-search baselines. The difference between wait-k and wait-k withdecoder catchup are compared in FIG. 17A for Chinese-to-Englishtranslation. For the direction English-to-Chinese, in FIG. 17B wait-kwith encoder catchup is shown since source side is much longer thantarget side.

CW measures the average source segment length and is also compared inTable 1.

TABLE 1 Compare with (Gu et al., 2017) on dev set with CW and BLEUscore. At similar or higher BLEU levels, disclosed models enjoy muchlower CWs. k = 3 k = 4 k = 5 k = 6 Gu et al. de→en CW 1.35 1.43 1.541.65 3.85 BLEU 18.54 19.78 20.53 21.23 20.70 en→de CW 1.13 1.22 1.331.48 3.36 BLEU 15.40 16.41 17.24 17.56 15.93

As analyzed in Sec. E, wait-k has AL close to 1. With similar or betterBLEU scores, CWs are much lower than those by Gu et al. (2017), whichindicates better user experience.

More comprehensive comparisons on the test sets are shown in Tables 2Aand 2B (FIGS. 25A and 25B) that show performance data of wait-k withTransformer, its catchup version, and wait-k with RNN models withvarious k on the de↔en and zh↔en test sets, according to variousembodiments of the present disclosure. For each k, the numbers on theleft side are from greedy decoding, and right, italic font numbers arefrom tail beam search. ∞ represents the baseline with results fromgreedy and beam search.

2. Quality and Latency of Wait-k Model

Test Train k = 1 k = 2 k = 3 k = 5 k = 7 k = ∞ k′ = 1 34.3 31.5 31.231.1 30.4 19.2 k′ = 3 34.9 36.2 37.2 37.7 37.3 19.5 k′ = 5 30.4 36.830.8 38.9 39.0 24.3 k′ = 7 30.6 36.6 38.6 39.4 39.1 23.1 k′ = 9 27.434.7 38.5 39.9 40.6 27.4 k′ = ∞ 26.2 32.7 36.9 39.3 41.0 43.7

Table 3 shows the results of a model according to various embodiments ofthe present disclosure that is trained with wait-k′ and decoded withwait-k (where ∞ means full-sentence). The disclosed wait-k is thediagonal, and the last row is the “test-time wait-k” decoding. It isnoted that good results of wait-k decoding may be achieved using a modelthat has been trained with a slightly larger than k′.

FIG. 14-FIG. 17 plot translation quality (in BLEU) against latency (inAP and CW) for full-sentence baselines, wait-k, test-time wait-k (usingfull-sentence models), and a reimplementation of Gu et al. (2017) on thesame Transformer baseline, according to various embodiments of thepresent disclosure. ★★: full-sentence (greedy and beam-search), Gu etal. (2017): ▪: AP=0.7. Note that their model trained with AP=0.7achieves a test-time AP of 0.8 and CW of 7.8.

FIG. 14A and FIG. 14B illustrate translation quality against latencymetrics (AP and CW) on German-to-English simultaneous translation,showing wait-k models (for k=1, 3, 5, 7, 9), test-time wait-k results,full-sentence baselines, and a reimplementation of Gu et al. (2017), allbased on the same Transformer, according to various embodiments of thepresent disclosure.

FIG. 15A and FIG. 15B illustrate translation quality against latencymetrics on English-to-German simultaneous translation, according tovarious embodiments of the present disclosure.

FIG. 16A and FIG. 16B illustrate translation quality against latency onzh→en. Gu et al. (2017):

: AP=0.3, ▾: AP=0.5, ▪: AP=0.7, according to various embodiments of thepresent disclosure.

FIG. 17A and FIG. 17B illustrate translation quality against latency onen→zh. Gu et al. (2017):

: AP=0.3, ▾: AP=0.5, ▪: AP=0.7, according to various embodiments of thepresent disclosure.

As FIG. 14 through FIG. 17 show, as k increases, (a) wait-k improves inBLEU score and worsens in latency, and (b) the gap between test-timewait-k and wait-k decreases. Eventually, both wait-k and test-timewait-k approach the full-sentence baseline as k→∞, consistent withintuition.

Next the results are compared with the reimplementation of Gu et al.(2017)'s two-staged full-sentence model+reinforcement learning onTransformer. On BLEU-vs.-AP plots, the two-staged full-sentence modelsperform similar to test-time wait-k for de↔en and zh↔en and slightlybetter than test-time wait-k for en→zh, which is reasonable as both usea full-sentence model at the core. However, on BLEU-vs-CW plots, thetwo-staged full-sentence models have much worse CWs, which is consistentwith results published by Gu et al. This is because the R/W modelprefers consecutive segments of READs and WRITEs (e.g., the two-stagedfull-sentence model often produces, e.g., R R R R R W W W W R R R W W WW R . . . ) while various embodiments using wait-k translateconcurrently with the input (the initial segment has length k, andothers have length l, thus, resulting in a relatively lower CW). It isnoted that training for the two-staged full-sentence models were foundto be relatively unstable due to the use of RL, whereas the presentedembodiments were very robust.

3. Examples and Discussion

FIGS. 18-23 showcase some translation examples. The figures illustratereal running examples that have been generated from the introducedmodel(s) and baseline framework to demonstrate the effectiveness of thedisclosed systems. Shown are the encoding step number and sourcelanguage (and pinyin when translated from Chinese) with its gloss in theupper side. Different generation results with different wait-k modelsand baselines are shown in the lower part of the tables in FIGS. 18-23.It is noted that the baseline method, which starts generating wordsafter the entire source sentence is encoded, is the last row, while thedisclosed model(s) only wait-k encoding steps.

FIG. 18 shows a German-to-English example in the dev set withanticipation. The main verb in the embedded clause, “einigen” (agree),is correctly predicted 3 words ahead of time (with “sich” providing astrong hint), while the auxiliary verb “kann” (can) is predicted as“has.” The baseline translation is “but, while congressional actioncannot be agreed, several states are no longer waiting.” bs.:Bundesstaaten.

FIG. 19 shows a Chinese-to-English example in the dev set withanticipation. Both wait-1 and wait-3 policies yield perfecttranslations, with “making preparations” predicted well ahead of time.^(⋄): continuous tense marker. †: +catchup, which produces slightlyworse output, and finishes ahead of the source sentence.

FIG. 20 shows a Chinese-to-English example from online news. The wait-3model correctly anticipates both “expressed” and “welcome” (thoughmissing “warm”), and moves the PP (“to . . . visit to china”) to thevery end which is fluent in the English word order.

FIG. 21 shows another Chinese-to-English example in the dev set. Again,both wait-1 and wait-3 correctly predicted “invitation” because theChinese construction “

” means “at the invitation of NP.” Furthermore, both predict “visit” (6words ahead of

time in wait-1), and wait-1 even predicts “Pakistan and India.” Thebaseline full-sentence translation is identical to that of our wait-1policy. Abbreviations: invit.: invitation; pak.: Pakistani/Pakistan;ind.: Indian/India; govts: governments; mar.: March.; &: and; †:+catchup, which produces the identical translation but predicts moreahead of time.

Except in FIG. 22 example (b), wait-k models generally anticipatecorrectly, often producing translations as good as the full-sentencebaseline. In FIG. 22, for example (a) both the verb “g{hacek over(a)}ndào” (“feel”) and the predicative “dānyōu” (“concerned”) arecorrectly anticipated, probably hinted by the word “missing.” †:+catchup. Example (b) shows that when the last word dānyōu is changed tobùmăn (“dissatisfied”), the wait-3 translation result remains unchanged(correct for example (a) but incorrect for example (b)), whereas wait-5translates conservatively and produces the correct translation withoutanticipation.

4. Human Evaluation on Anticipation

k = 3 k = 5 k = 7 k = 3 k = 5 k = 7 zh→en en→zh sentence % 33 21 9 52 2717 word % 2.51 1.49 0.56 5.76 3.35 1.37 accuracy % 55.4 56.3 66.7 18.620.9 22.2 de→en en→de sentence % 44 27 8 28 2 0 word % 4.50 1.50 0.561.35 0.10 0.00 accuracy % 26.0 56.0 60.0 10.7 50.0 n/a

Table 4 shows human evaluations on anticipation rates for sentences andwords and accuracy in all four directions, using 100 examples in eachlanguage pair from the dev sets. As shown, with increasing k,anticipation rates decrease (at both sentence and word levels), andanticipation accuracy improves. Moreover, anticipation rates differentgreatly among the four directions, with

en→zh>de→en>zh→en>en→de

Interestingly, this order is exactly the same as the order of theBLEU-score gaps between full-sentence models and a wait-9 modelaccording to various embodiments of the present disclosure:

en→zh: 2.0>de→en: 1.1>zh→en: 1.3†>en→de: 0.3

(†: difference in 4-ref BLEUs, which in experiments reduces by abouthalf in 1-ref BLEUs). This order roughly characterizes the relativedifficulty of simultaneous translation in these directions. As theexample sentence in FIG. 23 demonstrates, en→zh translation isparticularly difficult due to the mandatory long-distance reorderings ofEnglish sentence-final temporal clauses, such as “in recent years,” tomuch earlier positions in Chinese. It is also well-known that de→en ismore challenging in simultaneous translation than en→de since SOV→SVOinvolves prediction of the verb, while SVO→SOV generally does not needprediction in wait-k models for relatively small k, e.g., k=3, becausethe V is often shorter than the O. For example, human evaluation foundonly 1.3%, 0.1%, and 0% word anticipations in en→de for k=3, 5 and 7,and 4.5%, 1.5%, and 0.6% for de→en.

H. Related Work

The work of Gu et al. (2017) may be distinguished from variousembodiments in present disclosure in a number of key aspects. Forexample, the full-sentence model (a) cannot anticipate future words; (b)cannot achieve any specified latency metric, unlike the wait-k modelaccording to various embodiments that achieves a k-word latency; (c) isnot a genuine simultaneous model but rather a combination of two modelsthat uses a full-sentence base model to translate, thus, creating amismatch between training and testing, and (d) training is alsotwo-staged, using reinforcement learning (RL) to update the R/W model,unlike various embodiments in present disclosure that are trained fromscratch.

In a parallel work, some authors propose an “eager translation” modelthat outputs target-side words before the whole input sentence is fedinto the model. However, that model has two major drawbacks. First, itaims to translate full sentences using beam search and is, therefore,not a simultaneous model. Second, it does not anticipate future words.Third, it uses word alignments to learn the reordering and achieve it indecoding by emitting a: token. In contrast, various embodiments of thepresent disclosure integrate reordering into a single wait-k predictionmodel that is agnostic with respect to, yet, capable of reordering.

One approach adds the prediction action to the architecture of Gu et al.(2017), but the used encoder and decoder are still trained on fullsentences. Instead of predicting the source verb, which might come afterseveral words, this approach predicts the immediately following sourcewords, which is not particularly useful for SOV-to-SVO translation. Incontrast, various embodiments presented herein predict directly on thetarget side, thus, integrating anticipation into a single translationmodel.

I. Some Conclusions

Presented are prefix-to-prefix training and decoding frameworkembodiments for simultaneous translation with integrated anticipation,and embodiments of a wait-k policy that can achieve arbitrary word-levellatency while maintaining high translation quality. Theseprefix-to-prefix architecture embodiments have the potential to be usedin other sequence tasks outside of MT that involve simultaneity orincrementality.

J. Computing System Embodiments

Aspects of the present patent document are directed to informationhandling systems. For purposes of this disclosure, an informationhandling system may include any instrumentality or aggregate ofinstrumentalities operable to compute, calculate, determine, classify,process, transmit, receive, retrieve, originate, route, switch, store,display, communicate, manifest, detect, record, reproduce, handle, orutilize any form of information, intelligence, or data for business,scientific, control, or other purposes. For example, an informationhandling system may be a personal computer (e.g., desktop or laptop),tablet computer, mobile device (e.g., personal digital assistant orsmart phone), server (e.g., blade server or rack server), a networkstorage device, or any other suitable device and may vary in size,shape, performance, functionality, and price. The information handlingsystem may include random access memory (RAM), one or more processingresources such as a central processing unit (CPU) or hardware orsoftware control logic, ROM, and/or other types of nonvolatile memory.Additional components of the information handling system may include oneor more disk drives, one or more network ports for communicating withexternal devices as well as various input and output devices, such as aspeaker, a microphone, a camera, a keyboard, a mouse, touchscreen and/ora video display. The information handling system may also include one ormore buses operable to transmit communications between the varioushardware components.

FIG. 24 depicts a simplified block diagram of a computingdevice/information handling system (or computing system) according toembodiments of the present disclosure. It will be understood that thefunctionalities shown for system 2400 may operate to support variousembodiments of a computing system—although it shall be understood that acomputing system may be differently configured and include differentcomponents, including having fewer or more components as depicted inFIG. 24.

As illustrated in FIG. 24, the computing system 2400 includes one ormore central processing units (CPU) 2401 that provides computingresources and controls the computer. CPU 2401 may be implemented with amicroprocessor or the like, and may also include one or more graphicsprocessing units 2419 and/or a floating-point coprocessor formathematical computations. System 2400 may also include a system memory2402, which may be in the form of random-access memory (RAM), read-onlymemory (ROM), or both.

A number of controllers and peripheral devices may also be provided, asshown in FIG. 24. An input controller 2403 represents an interface tovarious input device(s) 2404, such as a keyboard, mouse, touchscreen,and/or stylus. The computing system 2400 may also include a storagecontroller 2407 for interfacing with one or more storage devices 2408each of which includes a storage medium such as magnetic tape or disk,or an optical medium that might be used to record programs ofinstructions for operating systems, utilities, and applications, whichmay include embodiments of programs that implement various aspects ofthe present invention. Storage device(s) 2408 may also be used to storeprocessed data or data to be processed in accordance with the invention.The system 2400 may also include a display controller 2409 for providingan interface to a display device 2411, which may be a cathode ray tube,a thin film transistor display, organic light-emitting diode,electroluminescent panel, plasma panel, or other type of display. Thecomputing system 2400 may also include one or more peripheralcontrollers or interfaces 2405 for one or more peripherals. Example ofperipheral may include one or more printers, scanners, input devices,output devices, sensors, and the like. A communications controller 2414may interface with one or more communication devices 2415, which enablesthe system 2400 to connect to remote devices through any of a variety ofnetworks including the Internet, a cloud resource (e.g., an Ethernetcloud, an Fiber Channel over Ethernet/Data Center Bridging cloud, etc.),a local area network, a wide area network, a storage area network, orthrough any suitable electromagnetic carrier signals including infraredsignals.

In the illustrated system, all major system components may connect to abus 2416, which may represent more than one physical bus. However,various system components may or may not be in physical proximity to oneanother. For example, input data and/or output data may be remotelytransmitted from one physical location to another. In addition, programsthat implement various aspects of the invention may be accessed from aremote location (e.g., a server) over a network. Such data and/orprograms may be conveyed through any of a variety of machine-readablemedium including, but are not limited to: magnetic media such as harddisks, floppy disks, and magnetic tape; optical media such as CD-ROMsand holographic devices; magneto-optical media; and hardware devicesthat are specially configured to store or to store and execute programcode, such as application specific integrated circuits (ASICs),programmable logic devices (PLDs), flash memory devices, and ROM and RAMdevices.

Aspects of the present invention may be encoded upon one or morenon-transitory computer-readable media with instructions for one or moreprocessors or processing units to cause steps to be performed. It shallbe noted that the one or more non-transitory computer-readable mediashall include volatile and non-volatile memory. It shall be noted thatalternative implementations are possible, including a hardwareimplementation or a software/hardware implementation.Hardware-implemented functions may be realized using ASIC(s),programmable arrays, digital signal processing circuitry, or the like.Accordingly, the “means” terms in any claims are intended to cover bothsoftware and hardware implementations. Similarly, the term“computer-readable medium or media” as used herein includes softwareand/or hardware having a program of instructions embodied thereon, or acombination thereof. With these implementation alternatives in mind, itis to be understood that the figures and accompanying descriptionprovide the functional information one skilled in the art would requireto write program code (i.e., software) and/or to fabricate circuits(i.e., hardware) to perform the processing required.

It shall be noted that embodiments of the present invention may furtherrelate to computer products with a non-transitory, tangiblecomputer-readable medium that have computer code thereon for performingvarious computer-implemented operations. The media and computer code maybe those specially designed and constructed for the purposes of thepresent invention, or they may be of the kind known or available tothose having skill in the relevant arts. Examples of tangiblecomputer-readable media include, but are not limited to: magnetic mediasuch as hard disks, floppy disks, and magnetic tape; optical media suchas CD-ROMs and holographic devices; magneto-optical media; and hardwaredevices that are specially configured to store or to store and executeprogram code, such as application specific integrated circuits (ASICs),programmable logic devices (PLDs), flash memory devices, and ROM and RAMdevices. Examples of computer code include machine code, such asproduced by a compiler, and files containing higher level code that areexecuted by a computer using an interpreter. Embodiments of the presentinvention may be implemented in whole or in part as machine-executableinstructions that may be in program modules that are executed by aprocessing device. Examples of program modules include libraries,programs, routines, objects, components, and data structures. Indistributed computing environments, program modules may be physicallylocated in settings that are local, remote, or both.

One skilled in the art will recognize no computing system or programminglanguage is critical to the practice of the present invention. Oneskilled in the art will also recognize that a number of the elementsdescribed above may be physically and/or functionally separated intosub-modules or combined together.

It will be appreciated to those skilled in the art that the precedingexamples and embodiments are exemplary and not limiting to the scope ofthe present disclosure. It is intended that all permutations,enhancements, equivalents, combinations, and improvements thereto thatare apparent to those skilled in the art upon a reading of thespecification and a study of the drawings are included within the truespirit and scope of the present disclosure. It shall also be noted thatelements of any claims may be arranged differently including havingmultiple dependencies, configurations, and combinations.

What is claimed is:
 1. A method for low-latency translation, the methodcomprising: receiving in a chronological sequence, at a neural network,source language tokens of an input, the neural network having beentrained to mimic translating in a chronological sequence by beingtrained to make a sequence of predictions of one or more target languagetokens in order to translate a training input, which comprises a fullsentence or an incomplete sentence, in which each prediction uses (1) adifferent sequence of one or more source language tokens from thetraining input and in which at least one or more of the differentsequences used in predicting comprise fewer source language tokens thanin the training input, and (2) one or more previously generated targetlanguage tokens, if any; using one or more received source languagetokens, which form a prefix, and the neural network to predict andoutput one or more target language tokens; using the neural network,repeating until a stop condition is reached steps comprising: responsiveto receiving a next source language token of the input that is not asource end-of-input signal: updating the prefix; using the updatedprefix and one or more previously predicted target language tokens topredict a next target language token; and outputting the next targetlanguage token; and responsive to receiving the source end-of-inputsignal, using all of the source language tokens in the input or a subsetthereof to generate any remaining target language tokens to completetranslation of the input.
 2. The method according to claim 1, whereinusing the one or more received source language tokens, which form theprefix, comprises using a monotonic non-decreasing function thatdefines, for each prediction step, which of the source language tokensform the prefix to be used for prediction.
 3. The method according toclaim 1, wherein the step of updating the prefix comprises using amonotonic non-decreasing function that defines the source languagetokens to be used to form the prefix for a prediction step.
 4. Themethod according to claim 1, wherein training the neural networkcomprises maintaining, for at least part of a prediction process for atraining input, a ratio between a number of source language tokensreceived and a number of target language predictions made.
 5. The methodaccording to claim 1, wherein the input is text received from a userinterface or from an input audio stream that has been converted to textusing automated speech recognition.
 6. The method according to claim 2,wherein the monotonic non-decreasing function that defines, for eachprediction step, which of the source language tokens form the prefix tobe used for prediction comprises waiting until a preset number of ksource language tokens have initially been received before performingthe step of using one or more received source language tokens, whichform the prefix, and the neural network to predict one or more targetlanguage tokens.
 7. The method according to claim 1, wherein the neuralnetwork comprises an encoder and a decoder, the method furthercomprising: determining a decoding step of the decoder at which theencoder has received the source end-of-input signal; determining anumber of tokens of the input that have been translated up to thedecoding step; and using the number of tokens in the input that havebeen translated up to the decoding step as a measure of how much thedecoder is out of synch with the input.
 8. A computer-implemented methodfor training a neural network for translating an input in a sourcelanguage into an output in a target language, the method comprising:given an input in the source language, which has a correspondingground-truth translation in the target language: receiving, at theneural network, source language tokens associated with the input, eachsource language token having an index value related to its order in theinput; repeating, until a stop condition is reached, steps comprising:using a pre-defined function that is a function of the index value toidentify a set of one or more source language tokens to be used for aprediction step, in which at least one set of one or more sourcelanguage tokens used in a prediction step comprises fewer than all ofthe source language tokens associated with the input; using theidentified set of one or more source language tokens and one or morepreviously predicted target language tokens, if any, to predict one ormore target language tokens; adding the predicted one or more targetlanguage tokens to the output in the target language; and incrementingthe index value; and responsive to a stop condition being reached, usinga comparison of the output in the target language to its ground-truthtranslation in updating the neural network.
 9. The computer-implementedmethod according to claim 8, wherein a first set of one or more sourcelanguage tokens defined by the pre-defined function comprises apredetermined number of initial source language tokens, which are usedin predicting the first one or more target language tokens for theoutput.
 10. The computer-implemented method according to claim 9,wherein the pre-defined function comprises generating, for eachsuccessive source language token of a set of consecutive source languagetokens, a subsequent target language token.
 11. The computer-implementedmethod according to claim 8, further comprising: responsive to reachinga cutoff point, applying a beam search or a full-input model to thesource language tokens of the input to generate any remaining targetlanguage tokens.
 12. The computer-implemented according to claim 11,wherein the cutoff point is a decoding step at an index when the inputends.
 13. The computer-implemented method according to claim 8, whereinthe neural network comprises an encoder that receives a sequence of wordembeddings for each word in the input and produces a correspondingsequence of hidden states, and the encoder is implemented using at leastone of a recurrent neural network (RNN) model and a Transformer.
 14. Thecomputer-implemented method according to claim 8, wherein the neuralnetwork does not generate a target end-of-input signal until after asource-side end-of-input signal has been received by the neural network.15. The computer-implemented method according to claim 13, wherein eachsource language token is constrained to attend only to its predecessorsthereby simulating an incremental encoder.
 16. A non-transitorycomputer-readable medium or media comprising one or more sequences ofinstructions which, when executed by at least one processor, causessteps to be performed comprising: receiving, at a neural network, asource language token of an input, in which source language tokens ofthe input are received in a chronological sequence and each sourcelanguage token is associated with a number that represents its number inthe chronological sequence of the source language tokens that have beenreceived; using a pre-defined function, which is a function of thenumber associated with the received source language token, to identify aset of one or more source language tokens to be used for a predictionstep, in which at least one set of one or more source language tokensused in a prediction step comprises fewer than all of the sourcelanguage tokens associated with the input; responsive to all of thesource language tokens in the identified set of one or more sourcelanguage tokens having been received by the neural network and processedby an encoder of the neural network to produce a corresponding set ofhidden states, using the corresponding hidden states of the identifiedset of one or more source language tokens and one or more previouslygenerated target language tokens, if any, to predict one or more targetlanguage tokens; responsive to not all of the source language tokens inthe identified set of one or more source language tokens having beenreceived by the neural network, waiting until all of the source languagetokens in the identified set of one or more source language tokenshaving been received by the neural network and processed by an encoderof the neural network to produce a corresponding set of hidden statesand using the corresponding hidden states of the identified set of oneor more source language tokens and one or more previously generatedtarget language tokens, if any, to predict one or more target languagetokens; adding the predicted one or more target language tokens to anoutput in the target language; responsive to a stop condition not beingreached, returning to the step of receiving, at a neural network, asource language token of an input to receive a next source languagetoken; and responsive to a stop condition being reached, concludingtranslation for the input.
 17. The non-transitory computer-readablemedium or media of claim 16, wherein the at least one processorcomprises a decoder predicts a first set of target language tokens inresponse to receiving a first set of hidden states corresponding to afirst set of source language tokens of the input.
 18. The non-transitorycomputer-readable medium or media of claim 16, wherein, in a trainingphase, the neural network was trained to mimic translating in achronological sequence by being trained to make a sequence ofpredictions of one or more target language tokens in order to translatea training input, which comprises a full sentence or an incompletesentence, in which each prediction uses (1) a sequence of one or moresource language tokens from the training input and in which at least oneor more of the sequences used in predicting comprise fewer sourcelanguage tokens than in the training input, and (2) one or morepreviously generated target language tokens, if any.
 19. Thenon-transitory computer-readable medium or media of claim 18, whereinthe encoder generates the sequence of hidden states by using either arecurrent neural network (RNN) or a transformer.
 20. The non-transitorycomputer-readable medium or media of claim 16, wherein the sourcelanguage tokens are received from text input or from an input audiostream that has been converted to source language tokens using automatedspeech recognition.